Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fp16 nchw for cudnn-fp16 backend (support GTX 16xx GPUs) #849

Merged
merged 20 commits into from
May 13, 2019

Conversation

ankan-ban
Copy link
Member

@ankan-ban ankan-ban commented May 11, 2019

For cudnn-fp16 backend: For supporting GPUs without tensor cores (e.g: GP100 and GTX 16xx series).
TODO:

  1. Figure out a way to check for tensor cores (and select nhwc vs nchw based on that).
  • done (Unfortunately involves string matching the device name. Hopefully will be fixed in some later cuda release.)
  1. actually run on a gtx 16xx card and check performance.
  • done (see below). Slightly less than 2x speedup over fp32.
  1. misc cleanup and clang format.
  • done
  1. (optional): maybe write a fused kernel for SE layer.
  • maybe later (or maybe not needed. With current implementation SE raw nps is only ~6% lower than non-SE net, tested with 256x20 networks).

use bestmove_is_sent_ for Search::IsSearchActive() (LeelaChessZero#502)
- replace all cudaMemcpyAsync used for loading weights with cudaMemcpy as  source (in CPU memory) could be deleted before the async version of the function actually does the copy.
- minor naming/style changes.
- add comment explaining what the policy map layer does and how the layout conversion from CHW to HWC works.
- try NCHW layout and winograd alogirhtm for convolutions (same as what we use for fp32).
- it's slower than NHWC/fp16 on GPUs with tensor cores, but should give some speedup on GP100 and TU11x GPUs.
@ankan-ban ankan-ban added the wip Work in progress label May 11, 2019
@ankan-ban
Copy link
Member Author

ankan-ban commented May 11, 2019

Some benchmarks on GTX 1650

1. FP32
Benchmark final time 9.77541s calculating 2303.53 nodes per second.

2. fp16 with nhwc (current default)
Benchmark final time 9.35774s calculating 224.627 nodes per second.

3. fp16 with nchw layout - with 'TENSOR_OP_MATH' setting enabled.
Benchmark final time 8.7635s calculating 535.517 nodes per second.

4. fp16 with nchw layout - without 'TENSOR_OP_MATH' setting enabled.
Benchmark final time 8.67777s calculating 4238.88 nodes per second.

Its surprising that the fp16/nhwc path even works on GTX 1650. Maybe cudnn/cublas is just emulating it and that's why it's so slow. Even with nchw path, if 'TENSOR_OP_MATH' flag is enabled, it's still very slow (again likely because it has to emulate tensor cores somehow).

Good news is fp16/nchw layout without the 'TENSOR_OP_MATH' is almost 2x faster than fp32.

@ankan-ban ankan-ban changed the title Fp16 nchw Fp16 nchw for cudnn-fp16 backend (support GTX 16xx cards) May 12, 2019
@ankan-ban ankan-ban changed the title Fp16 nchw for cudnn-fp16 backend (support GTX 16xx cards) Fp16 nchw for cudnn-fp16 backend (support GTX 16xx GPUs) May 12, 2019
- GP100 (SM6.0)
- GTX 16xx GPUs (unfortunately same sm 7.5 version so need a string compare)
@ankan-ban ankan-ban removed the wip Work in progress label May 12, 2019
@ankan-ban ankan-ban requested a review from borg323 May 12, 2019 07:51
default is auto-select (-1).
src/neural/cuda/common_kernels.cu Outdated Show resolved Hide resolved
src/neural/cuda/network_cudnn.cc Show resolved Hide resolved
Use bool option instead of int and use IsDefault mechanism to check if the option was forced or not.
@ankan-ban ankan-ban merged commit fa926e5 into LeelaChessZero:master May 13, 2019
@ankan-ban ankan-ban deleted the fp16-nchw branch May 13, 2019 03:46
@ankan-ban ankan-ban restored the fp16-nchw branch May 16, 2019 16:58
@@ -0,0 +1,2 @@
 layers.cc
lc0@exe.vcxproj -> C:\Ankan\git\ankan\lc0\build\.\lc0.exe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this file? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Likely some intermediate build file that accidentally got submitted. Will remove it.

@rajb245
Copy link
Contributor

rajb245 commented Sep 3, 2019

Does the merged work apply only to the cards using the GP100, i.e., the Quadro GP100 and the Tesla P100? Can similar techniques apply to the other Pascal chips, in particular, GP102 chips like the Titan X (Pascal) and Titan Xp have? NVIDIA advertises some level of fp16 acceleration on those, but I don't know enough of the implementation to know the differences.

If there's a path to accelerate performance on GP102 using similar techniques, please let me know and I'll open a feature request issue.

@ankan-ban
Copy link
Member Author

Unfortunately no, other Pascal chips (gp102/gp104/gp106, etc) don't have support for fp16 math. They do support higher throughput int8 math but right now we don't have support in lc0 for int8 precision.

@jjoshua2
Copy link
Contributor

jjoshua2 commented Sep 8, 2019 via email

@borg323
Copy link
Member

borg323 commented Sep 8, 2019

Do you know whether using fp16 with CC 6.2 (jetson TX2) is also a performance gain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants